Script and Language Identification for Document Images and Scene Texts

نویسندگان

  • Ajeet Kumar Singh
  • Jayanthi Sivaswamy
  • Avinash Sharma
  • Praveen Krishnan
چکیده

In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, these individual OCRs fail significantly. Hence, in order to recognize texts in the multilingual document image or scene-image, we need to, manually, specify the script or language for each text blocks. Then, the corresponding script/language OCR is applied to recognize the inherent tasks. This is a step which is preventing us to move forward in the direction of fully-automated multi-lingual OCRs. This thesis presents, two effective solutions to identify the scripts and language of document images and scene-texts, automatically. Even though, recognition problems for scene texts has been highly researched, the script identification problem in this area is relatively new. Hence, we present an approach which represents the scene-text images using mid-level strokes based features which are pooled from the densely computed local features. These features are then classified into languages by using an offthe-shelf classifier. This approach is efficient and require very less labeled data for script identification. The approach has been evaluated on recently introduced video script dataset (CVSI). We also introduce and benchmark a more challenging Indian Language Scene Text (ILST) dataset for evaluating the performance of our method. For script and language identification in document we investigate the utility of Recurrent Neural Network (RNN). These problems have been attempted in the past with representations computed from the distribution of connected components or characters (e.g. texture, n-gram) from a larger segment (a paragraph or a page). We argue that, one can predict the script or language with minimal evidence (e.g. given only a word or a line) very accurately with the help of a pre-trained RNN. We propose a simple and generic solution for the task of script and language identification without any special tuning. This approach has been verified on a large corpus of more that 15.03M words across 55K documents comprising 15 scripts and languages. The thesis aims to provide a better recognition solutions in document and scene-texts space by providing two simple, but effective solutions for script and language identification. The proposed algorithms can be used in multilingual settings, where the identification module will first identify the inherent script or language of incoming document or scene-texts before sending them to corresponding script/language recognition module.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text

An end-to-end method for multi-language scene text localization, recognition and script identification is proposed. The approach is based on a set of convolutional neural nets. The method, called E2E-MLT, achieves state-of-theart performance for both joint localization and script identification in natural images and in cropped word script identification. E2E-MLT is the first published multi-lan...

متن کامل

Script Identification for Document Image Retrieval: A Survey

In recent years there are many multimedia documents captured and stored with the advances in computer technology and hence the demand for recognizing and retrieval of such documents has increased tremendously .In such environment the large volume of data and variety of scripts make manual identification unworkable. In such cases the ability to automatically determine the script ,and further the...

متن کامل

Density Based Script Identification of a Multilingual Document Image

Automatic Pattern Recognition field has witnessed enormous growth in the past few decades. Being an essential element of Pattern Recognition, Document Image Analysis is the procedure of analyzing a document image with the intention of working out the contents so that they can be manipulated as per the requirements at various levels. It involves various procedures like document classification, o...

متن کامل

Neural network based system for script identification in Indian documents

The paper describes a neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts. Script identification is a basic requirement in automation of document processing, in multi-script, multi-lingual environments. The system developed includes a feature extractor and a modular neural network. The fea...

متن کامل

Natural scene text localization using edge color signature

Localizing text regions in images taken from natural scenes is one of the challenging problems dueto variations in font, size, color and orientation of text. In this paper, we introduce a new concept socalled Edge Color Signature for localizing text regions in an image. This method is able to localizeboth Farsi and English texts. In the proposed method rst a pyramid using diff...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015